'BonTen' - Corpus Concordance System for 'NINJAL Web Japanese Corpus'

نویسندگان

Masayuki Asahara

Kazuya Kawahara

Yuya Takei

Hideto Masuoka

Yasuko Ohba

Yuki Torii

Toru Morii

Yuki Tanaka

Kikuo Maekawa

Sachi Kato

Hikari Konishi

چکیده

The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising 25 billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents a corpus concordance system named ‘BonTen’, which enables a ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms

In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...

متن کامل

Utilizing lexical data from a web-derived corpus to expand productive collocation knowledge

Collocations are of great importance for second language learners, and a learner’s knowledge of them plays a key role in producing language fluently (Nation, 2001: 323). In this article we describe and evaluate an innovative system that uses a Web-derived corpus and digital library software to produce a vast concordance and present it in a way that helps students use collocations more effective...

متن کامل

A Web Corpus and Word Sketches for Japanese

Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processi...

متن کامل

Corpus Linguistics with BNCweb - a Practical Guide

Book synopsis This book presents a richly illustrated, hands-on discussion of one of the fastest growing fields in linguistics today. The authors address key methodological issues in corpus linguistics, such as collocations, keywords and the categorization of concordance lines. They show how these topics can be explored step-by-step with BNCweb, a user-friendly web-based tool that supports soph...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

'BonTen' - Corpus Concordance System for 'NINJAL Web Japanese Corpus'

نویسندگان

چکیده

منابع مشابه

Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms

Utilizing lexical data from a web-derived corpus to expand productive collocation knowledge

A Web Corpus and Word Sketches for Japanese

Corpus Linguistics with BNCweb - a Practical Guide

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

عنوان ژورنال:

اشتراک گذاری